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Abstract 



We begin this report by describing the Probably Approximately Correct (PAC) model 
for learning a concept class, consisting of subsets of a domain, and a function class, 
consisting of functions from the domain to the unit interval. Two combinatorial 
parameters, the Vapnik-Chervonenkis (VC) dimension and its generalization, the Fat 
Shattering dimension of scale e, are explained and a few examples of their calculations 
are given with proofs. We then explain Sauer's Lemma, which involves the VC 
dimension and is used to prove the equivalence of a concept class being distribution- 
free PAC Icarnablc and it having finite VC dimension. 

As the main new result of our research, we explore the construction of a new 
function class, obtained by forming compositions with a continuous logic connective, 
a uniformly continuous function from the unit hypercube to the unit interval, from a 
collection of function classes. Vidyasagar had proved that such a composition function 
class has finite Fat Shattering dimension of all scales if the classes in the original 
collection do; however, no estimates of the dimension were known. Using results 
by Mendelson-Vershynin and Talagrand, we bound the Fat Shattering dimension of 
scale e of this new function class in terms of the Fat Shattering dimensions of the 
collection's classes. 

We conclude this report by providing a few open questions and future research 
topics involving the PAC learning model. 
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1 Introduction 



In the area of statistical learning theory, the Probably Approximately Correct (PAC) 
learning model formalizes the notion of learning by using sample data points to 
produce valid hypotheses through algorithms. For instance, the following illustrates 
one learning problem which can be formalized in the PAC model. Given that there 
is a disease which affects certain people and out of 100 people in a hospital, 12 of 
them are sick with this disease. Is there a way to predict whether any given person 
in the hospital has the disease or not? 

This report covers the PAC learning model applied to learning a collection of 
subsets C, called a concept class, of a domain X and more generally, a collection of 
functions J-", called a function class, from X to the unit interval [0,1]. The report 
involves mostly concepts from analysis and some concepts from probability theory, 
but only the completion of the first two years of undergraduate studies in mathematics 
are assumed from the readers. 

Report outline 

First, we give two definitions of PAC learning, one for a concept class C and the 
other for a function class J-", and explore two combinatorial parameters, the Vapnik- 
Chervonenkis (VC) dimension and the Fat Shattering dimension of scale e, for C and 
J-", respectively. Then, we explain Sauer's Lemma, a theorem which involves the VC 
dimension of C and is used to prove that the finiteness of this dimension is a sufficient 
condition for C to be learnable. 

Finally, as the main new result of our research, given function classes J-'i, . . . ,J^k 
and a "continuous logic connective" (that is, a continuous function u : [0, l]'^ — )■ 
[0, 1]), we consider the construction of a new composition function class u{J-'i, . . . , J-'k), 
consisting of functions u{fi, . . . , fk) defined by . . . , fk){x) = u{fi{x), fk{x)) 
for fi & J^i. We then bound the Fat Shattering dimension of scale e of this class in 
terms of a sum of the Fat Shattering dimensions of scale S{e,k) of J-*!, . . . , J-^, where 
S{e, k) only depends on e and k. There is a previously known analogous estimate for 
a composition of concept classes built using a usual connective of classical logic |18j. 
We deduce our new bound using results from Mendelson-Vershynin and Talagrand. 

Before jumping into the PAC learning model, we provide some basic terminology 
and results from analysis and measure theory. From now on, any propositions or 
examples given with proofs, unless mentioned otherwise, are done by us and are 
independent of any sources. 
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2 Brief Overview of Analysis and Measure Theory 



This section lists some definitions and results in measure theory and analysis, found 
in standard textbooks, such as [6], [18], and [2], which are used in this report. 

Probability space 

Definition 2.1. Let X be a set. A cr-algebra S is a non-empty collection of subsets 
of X such that the following are satisfied: 



If S is a a-algebra, then the pair {X,S) is called a measurable space. 

Definition 2.2. Suppose {X,S) and {Y,T) are two measurable spaces. A function 
f : X ^ Y is called measurable if f~^{T) G S for all T & T. 

Definition 2.3. Given a measurable space {X,S), a function yU : 5 — t- IR+ = {r G 
M : r > 0} is a measure if the following hold: 

1. /i(0) =0 

2. If Ai E S for all i E N and Ai fl A^- = whenever i ^ j , then 



The triple {X,S,fi) is called a measure space. If in addition, ^ satisfies ^{X) = 1, 
then n is a probability measure and {X,S,fi) is called a probability space. 

Given a probability space {X,S,n), one can measure the difference between two 
subsets A, B E S of X hj looking at their symmetric difference A A B, which is 
indeed in S: 



More generally, given two measurable functions /, (7 : X — )■ [0, 1], one can look at the 
expected value of their absolute difference by integrating with respect to /x: 



1. IfAe S, thenX\Ae S 





li{A AB) = ^i{{A UB)\{An B)) 

= fi{{{X\A)nB)UiAn{X\B))). 
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This report does not go into any details involving the Lebesgue integral but does 
assume that integration of measurable functions to the real numbers, which is a 
measure space, makes sense and is linear and order-preserving: 



if f{x) < g{x) for all x & X. 

Validating hypotheses in the PAC learning model uses the idea of measuring the 
symmetric difference of two subsets of a probability space (X, S, fj.) and calculating 
the expected value of the difference of f,g : X — > [0,1]. The structure of metric 
spaces arises naturally from these two notions. 

Metric spaces 

Definition 2.4. Let M be a nonempty set. A function d : M x M ^ is a metric 
if the following hold for all mi, m2, e M: 

1. d(mi, 777.2) — if o-nd only if mi — 7772 

2. (i(mi, 772.2) = d{m2,mi) 

3. d{mi, 7772) < d(777i, 7773) + ^(7773, 7772) 

In this case, the pair (M, d) is called a metric space. 

Definition 2.5. Given a metric space {M,d), a metric sub-space of M (which is 
a metric space in its own right) is a nonempty subset M' C M equipped with the 
distance d\^,, the restriction of d to M' . 

The structure of a metric space exists in every vector space equipped with a norm. 

Definition 2.6. Suppose V is a vector space over R. A function p : V ^ R"^ is a 
norm on V if for all Vi,V2 &V and for all r e 

1. p{rvi) = |r|p(t;i) 

2. p(f 1 + V2) < p{vi) + p{v2) 

3. p{vi) — if and only if vi — 

If p is a norm on V , then (V, p) is called a normed vector space. 




and 
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Proposition 2.7. Based on Definition \2.6i the function d : V x V ^ defined by 
d{u, v) = p{u — v) is a metric on V , and d is called the metric induced by tlie norm 
p on V . 

Tlie following subsection provides a few examples of metric spaces which will be 
encountered in this report. 

Examples of metric spaces 

The real numbers (M, p), with the absolute value norm p{r) = \r\ for r G M, is a 
normed vector space so R can be equipped with a metric structure. 

Example 2.8. The set R with distance d defined by d{ri,r2) = \ri — for Vi, r2 G M 
is a metric space. 

The unit interval [0, 1] is a subset of R, so it is a metric sub-space of (R, d), and 
this space will be used quite often in this report. 

Given a probability space {X, S, p), the set V of all bounded measurable functions 
from X to R is a vector space, with point-wise addition and scalar multiplication. 
The function p : V ^ R"*" defined by 

p(/) = ^ (^ju{x)Ydp{x)^ 

is a norm on V if any two functions /, (7 : X — )■ R which agree on a subset of X with 
full measure, p{{x G X : f{x) = g{x)}) = 1, are identified!^ The norm p is called the 
L2(/i) norm on V and we normally write ||/||2 = p(/) for f (zV . As a result, V can 
be turned into a metric space. 

Example 2.9. Following the notations in the paragraph above, V is a metric space 
with distance d defined by 

d{f,g) = \\f - g\\2 = sj(^jj.f{x) - g{x)Ydp{x)y 

Write [0, 1]"^ for the set of all measurable functions from a probability space 
(X, 5,p) to [0, 1]. Then, it is a metric sub-space of V with distance induced by the 
L2{p) norm on V , restricted of course to [0, 1]"^. 

Given metric spaces (Mi, (ii), . . . , (M^, rf^), their product Mi x . . . x always 
has a metric structure. 

^This identification can be done using an equivalence relation, so this report will not go into any 
details here. 
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Example 2.10. // (Mi, di), . . . , (M^,, dk) are metric spaces, then their product Mi x 
. . . X Mfc is a metric space with distance d"^ defined by 



rf^((mi, . . . , mfc), (m'l, . . . , m'^)) = y' {{di{mi,m[)y + ... + {dk{mk,m'^)f). 

The distance d"^ is normally referred to as the L2 product distance on Mi x . . . x M^ . 

From Examples I2.8l and l2.10[ the set [0, l]'^, which denotes the set-theoretic prod- 
uct [0, 1] X ... X [0, 1] is then a metric space with distance d'^ defined by 

o?^((ri, . . . , rfc), {r'l, r^)) = ^J (|ri - r^P + . . . + - r[\^). 

Also, following Examples 12.91 and 12.101 ii J^i, . . . , J-'k are sets of measurable functions 
from a probability space (X, 5, /i) to the unit interval, then J-'j C [0,1]"^ for each 
i = 1, . . . ,k. Therefore, the product J^i x . . . x J^^ is a metric space with distance 
defined by 

d'iifi, ...,/.),(/(,...,/!)) = ^mi-fi\\2? + ... + i\\h-m\2y). 
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3 The Probably Approximately Correct Learning 
Model 



Let {X,S) be a measurable space. A concept class C of X is a subset of S and an 
element A G C (a measurable subset of X) is called a concept. A function class T is 
a collection of measurable functions from X to the unit interval [0, 1]. Unless stated 
otherwise, from this section onwards, the following notations will be used: 

\. X = {X,S\. a measurable space 

2. yu: a probability measure S — ?■ M"*" 

3. C: a concept class and J-": a function class 

4. [0, 1]"^: the set of all measurable functions / : X — [0, 1], instead of the cus- 
tomary notation of all functions from X to [0, 1]. 

This section provides the definitions of learning C and J-" in the Probably Approx- 
imately Correct (PAC) learning model, introduced in 1984 by Valiant. 

Concept class PAC learning involves producing a valid hypothesis for every con- 
cept A G C by first drawing random points, forming a training sample, from X 
labeled with whether these points are contained in A. In other words, a labeled 
sample of m points xi, . . . , Xm G X for A consists of these points and the evaluations 
Xa{xi), . . . , XA{xm) of the indicator function : X — )■ {0, 1}, where 

Xa{x) = 1 if and only if x G A. 

On the other hand, an unlabeled sample of points does not include these evaluations. 
The set of all labeled samples of m points can then be identified with (X x {0, l})™, 
and producing a hypothesis for A with a labeled sample is exactly the process of 
associating the sample to a concept H & C (i.e. this process is a function from the 
set of all labeled samples to the concept class) . 

Here is the precise definition of a concept class being learnable. 

Definition 3.1 ([16]). A concept class C is distribution-free Probably Approximately 
Correct learnable if there exists an algorithn^ L : UmeN(-'^ x {0, 1})™ — C with the 
following property: for every e > 0, for every 6 > 0, there exists a M G N such 
that for every A E C, for every probability measure fi, for every m > M , for any 
Xi,...,Xm G X, we have fi{Hm A A) < e with confidence at least I — S, where 
= L((xi,XA(a;i)), . . . , (x^, ^^(a;™)))- 

Confidence of at least 1 — 5 in the definition above, keeping to the same notations, 
simply means that the (product) measure of the set of all m-tuples {xi, . . . , Xm) G X™, 

■^In this report, a learning algorithm is simply defined to be a function. 
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where n{Hm A A) < e for Hm = L{{xi, Xa{xi)), (x^, XA{xm))), is at least 1 - S. 
In other words, an equivalent statement to C is distribution-free PAC learnable is 
that for every e,S > 0, there exists M G N such that for every A G C, probability 
measure /i, and m > M, 



for Hm = L{{xi, xa{xi)), (x^, XA{Xm)))- 

A concept class C is distribution-free learnable in the PAC learning model if 
a hypothesis H can always be constructed from an algorithm L for every concept 
A & C, using any labeled sample for A, such that the measure of their symmetric 
difference H A A is arbitrarily small with respect to every probability measure and 
with arbitrarily high confidence, as long as the sample size is large enough. 

Every concept A G C is a subset of X so A can be associated to its indicator 
function : X — )■ {0, 1}. Even more generally, xa is a function from X to [0, 1]; in 
other words, every concept class C can be identified as a function class J-'c = {xa '■ 
X — )• [0, 1] : A G C}, so it is natural to generalize Definition 13 . 1 1 for any function class 



Definition 13 . 1 1 involves the symmetric difference of two concepts and its generaliza- 
tion to measurable functions f,g:X^ [0, 1] is the expected value of their absolute 
difference E^{f,g), as seen in the previous section: 



A simple exercise can show that if /, (7 G [0, 1] take values in {0, 1}, so they are in- 
dicator functions of two concepts A,B C X, then Efj^{f, g) coincide with the measure 
of their symmetric difference: E^{f, g) = fi{A A B), where f = Xa and g = xb- 

With the generalization of the symmetric difference, distribution-free PAC learn- 
ing for any function class can be defined. In the context of function class learning, 
a labeled sample of m points Xi, . . . , Xm € X for a function j ^ T consists of these 
points and the evaluations /(xi), . . . , /(xm)- Then, the set of all labeled samples 
of m points can be identified with (X x [0, 1])"^, and producing a hypothesis is the 
process of associating a labeled sample to a function H ^ T (just as in concept class 
learning) . 

Definition 3.2 ([H]). A junction class T is distribution-free Probably Approxi- 
mately Correct learnable if there exists an algorithm L : UmemiX x [0, 1])™" — > 
with the following property: for every e > 0, for every 6 > 0, there exists a M G N 
such that for every f ^ T , for every probability measure fi, for every m > M, for 
any xi, . . . ,Xm € X, we have E^{Hm, f) < e with confidence at least 1 — 5, where 

Hm = L{{xi, /(Xi)), . . . , {Xm, fiXm)))- 

^The symbol denotes the product measure on X™; the reader can refer to [6j for the details. 



.,Xm)eX^:ix{HmAA)>e})<5B 
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Both definitions of PAC learning contain tlie e and S parameters. Tlie error 
parameter e is used because tfie liypotliesis is not required to liave zero error - only 
an arbitrarily small error. The risk parameter S exists because there is no guarantee 
that any collection of sufficiently large training points leads to a valid hypothesis; the 
learning algorithm is only expected to produce a valid hypothesis with the sample 
points with confidence at least 1 — 6. Hence, the name "Probably (6) Approximately 
(e) Correct" is used [8]. 

The following example illustrates that the set of all axis-aligned rectangles in 
is distribution-free PAC learnable. Both the statement and its proof can be found in 
Chapter 3 of pS] and Chapter 1 of [8]. 

Example 3.3. In X = M^, the concept class C = {[a,b] x [c,d] : a,b,c,d G M} is 
distribution-free PAC learnable. 

Proof. Let e,6 > 0. Given a concept A and any sample of m training points 
Xi, . . . , Xm G X, define the hypothesis concept Hm to be the intersection of all rect- 
angles containing only training points Xj such that XA{xi) = 1. In other words, Hm 
is the smallest rectangle that contains only the sample points in A. 

Let /i be any probability measure, and in fact, Hm A A = A \ Hm, which can be 
broken down into four sections Ti, . . . , T4. If we can conclude that 



< e, 



with confidence at least 1 — S, then the proof is complete. 

Consider the top section Ti and define Ti to be the rectangle along the top parts 
of A whose measure is exactly e/4. The event Ti C Ti, which is equivalent to 
f^{Ti) > e/4, holds exactly when no points in the sample xi, . . . , Xm fall in Ti, and the 
probability of this event (which is the measure of all such m-tuples of (xi, . . . , x^) G 
X™' where Xj ^ Ti for alH = 1, . . . , m) is 

Similarly, the same holds for the other three sections T2, . . . , T4. Therefore, the prob- 
ability that there exists at least one Tj such that /i(Tj) > e/4, where i G {1, . . . ,4}, 
is at most 

Hence, as long as we pick m large enough that 4(1 — < 5, with confidence 

(probability) at least 1 — 5, fi{Ti) < e/4 for every i = 1, . . . , 4 and thus, 

li{H^ AA) = ix ([jT^j < /i(Ti) + . . . + /i(T4) < 4 (^) = e. 
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Please note that this argument, though very intuitive, actually requires the classical 
Glivenko-Cantelli theorem. 

In summary, as long as m > (4/e) ln(4/5), with confidence at least l — S, fi{Hm^ 
A) < e. We note that this estimate of the sample size only depends on e and 6, so C 
is indeed distribution-free PAC learnable. □ 

In the next section, a fundamental theorem which characterizes concept class 
distribution-free PAC learning will be stated, and two more concept classes, one 
learnable and the other not0 will be given. However, in order to state this theorem, 
the notion of shattering, which is essential in learning theory, must be introduced. 



''They are direct results of the theorem. 
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4 The Vapnik-Chervonenkis Dimension 



The Vapnik-Chervonenkis dimension is a combinatorial parameter which is defined 
using the notion of shattering, developed first in 1971 by Vapnik and Chervonenkis. 

Definition 4.1 ([H]). Given any set X and a collection A of subsets of X, the 
collection A shatters a subset S* C X if for every B <Z S, there exists A E A such 
that 

AnS = B. 

There is an equivalent condition, which is sometimes easier to work with, to 
shattering, expressed in terms of characteristic functions of subsets of X. 

Proposition 4.2. The collection A shatters a subset S = {xi, . . . ,Xn} ^ X if and 
only if for every e = (ei, . . . , e„) G {0, 1}*^, there exists A E A such that 



XA{Xi) = Ci, 

for all i = 1, . . . ,n. 

Proof. Trivial. □ 

Definition 4.3 ([IZ])- The Vapnik-Chervonenkis (VC) dimension of the collection 
A, denoted by VC(^), is defined to be the cardinality of the largest finite subset 
sex shattered by A. If A shatters arbitrarily large finite subsets of X , then the 
VC dimension of A is defined to be oo. 

The VC dimension is defined for every collection A of subsets of any set X, so in 
particular, X = {X, S) can be a measurable space and A = C can be a concept class. 

The following are a few examples of how to calculate VC dimensions in the context 
of X = M". In order to prove the VC dimension of a concept class C is d, we must 
provide a subset 5 C X with cardinality d which is shattered by C and prove that no 
subset with cardinality d + 1 can be shattered by C. 

Example 4.4. // X = R, then the powerset of X has infinite VC dimension. More 
generally, for every infinite set X , VC(P(X)) = oo. 

Example 4.5. In the space X = M, let C = {[a,b] : a, 6 G M, a < b} be the collection 
of all closed intervals. Then, VC(C) = 2. 

Proof. Consider the subset S = {1,2} CM.; C shatters S because 



[a, b]nS 



if a > 2 or 6 < 1 

{1} ifa<l,6<2 

{2} ifa>l,6>2 

^{1,2} ifa<l,6>2. 
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On the other hand, given any subset S = {x, y,z} CM with three distinct points, 
and assume the order to he x < y < z. Then, there are no closed interval in C 
containing x and z but not y. □ 



Example 4.6. Consider the space X = M". A hyperplane H^ h is defined by a nonzero 
vector a = (oi, . . . , a„) G M" and a scalar b G M/ 

Hs,b = {x = {xi, . . . , Xn) G M" : X ■ a = 6} 

= {x = (xi, . . . , x„) G M" : Xiai + . . . + x„a„ = b}. 

Write C as the set of all hyperplanes: C = {Hg^t : a G M" \ {0},^ G M}. Then 
VC(C) = n. 

Proof. Consider the subset S = {ci, . . . , e„} C R", where Cj is the vector with 1 on 
the i-th component and everywhere else. Suppose B C S and there are two cases 
to consider: 

1. li B = 0, then let a = (1, 1, . . . , 1) G M" and the hyperplane H^^i = {x = 
(xi, . . . , Xn) G M" : Xi + . . . + x„ = —1} is disjoint from S. 

2. If B 0, then set a = (ai, . . . , a„) G M'' \ {0}, where = XB{ei). Then the 
hyperplane H^i = {x = (xi, . . . , x„) G M" : XiOi + . . . + x„a„ = 1} satisfies 

Hs,i ns = B. 

Moreover, no subset S = {xi, . . . , x„, x^+i} ^ R"- with cardinality n + 1 can be 
shattered by C. At best, there exists a unique hyperplane ifg^b containing n of these 
points, say {xi, . . . , x„}, so if Xn+i G -^^,6, then there are no hyperplanes that include 
xi, . . . ,x„, but not Xn+i- Otherwise, if x„+i ^ -f^a.b, then there are no hyperplanes 
that include xi, . . . , Xn, x„+i. □ 

The first example is trivial and the second is fairly well-known, seen in [8] and 
[TU] . but we believe the third. Example I4.6[ is a new result. 

A very important concept related to shattering is the growth of all the possible 
subsets A r\ S, for A E C, as S C X increases in size. It is clear that this growth 
is always exponential if C has infinite VC dimension; Sauer's Lemma explains the 
growth when VC(C) < oo. 

4.1 Sauer's Lemma 

Given a concept class C of X, another way to express that C shatters a subset C X, 
with cardinality n, is to consider the set of all AnS, where A E C. Following Chapter 
4 of [18], C shatters 5* if and only if 

\{AnS : AeC}\ =2"". 
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More generally, for any subset S" C X, define 



7r(^;C) = \{AnS : AeC}\ 



and 



7r(n;C) 



max 71 



\S\=n 



Then, the VC dimension of C can now be expressed in terms of the growth of 7r(?T,; C) 
as n gets large. 

Proposition 4.7. Given a concept class C, the following conditions are equivalent: 



2. C shatters some subset S* C X with cardinality n; 

3. 7r(n;C) = 2". 

Moreover, the class C has infinite VC dimension if and only if Ti{n]C) = 2" for 
all n E N. Conversely, C has finite VC dimension, say VC(C) < d, if and only if 
7r(?7,; C) < 2" for all n > d. 

Proof. The proof follows from the fact that C shatters S if and only if ^{S; C) = 
2". □ 

The extremely interesting fact, as seen in the next theorem, is that if C has finite 
VC dimension d, then 7r(n;C) is bounded by a polynomial in n of degree d, for 
n > d. This result, called Sauer's Lemma, was first proven in 1972 by Sauer. In 
other words, as n gets large, n{n;C) is either always an exponential function with 
base 2 or eventually bounded by a polynomial function of a fixed degree. 

Theorem 4.8 (Sauer's Lemma [12] )• Suppose a concept class C has finite VC di- 
mension d. Then 



for all n > d > 1. 

Of course, everything in this subsection, including Sauer's Lemma, is true for 
any collection of subsets of any set but in the context of statistical learning theory, 
Sauer's Lemma is particularly useful because it is used to prove the equivalence of a 
concept class having finite VC dimension and the class being distribution-free PAC 
learnable. 



1- VC(C) > n; 
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4.2 Characterization of concept class distribution-free PAC 
learning 

The following is one of the main theorems concerning PAC learning, whose proof 
results from Vapnik and Chervonenkis' paper p/T] in 1971 and the 1989 paper [5] by 
Blumer et al.. 

Theorem 4.9 ([17J and [5J). Let C be a concept class of a measurable space {X,S). 
The following are equivalent: 

1. C is distribution-free Probably Approximately Correct learnable. 

2. VC(C) < oo. 

Both directions of the proof require expressing the number of sample training 
points required for learning in terms of the VC dimension of C; Sauer's Lemma is 
used to provide a sufficient number of points required for learning in the direction 
2)^1). 

Using Theorem 14.91 one can more easily determine whether a given concept class 
is distribution-free PAC learnable. 

Example 4.10. LetX be any infinite set. Then the powersetV{X) is not distribution- 
free PAC learnable. 

Example 4.11. The set of all hyperplanes C = {Hs,b : a G M" \ {0},b G M}, as 
defined in Example \4.6[ is distribution-free PAC learnable. 

Both examples come directly from the calculations of their concept classes' VC 
dimensions in Examples 14.41 and 14.61 and from Theorem 14.91 

Every concept class C can be viewed as a function class J^c = {xa : X — )■ [0, 1] : 
A G C}, as seen in Section |3l so a natural question is whether the notion of shattering 
can be generalized. Indeed, the next section introduces the Fat Shattering dimension 
of scale e, which is a generalization of the VC dimension. 
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5 The Fat Shattering Dimension 



Let e > from this section onwards. A combinatorial parameter which generahzes the 
Vapnik-Chervonenkis dimension is the Fat Shattering dimension of scale e, defined 
first by Kearns and Schapire in 1994. 

This dimension, assigned to function classes, involves the notion of e-shattering, 
but similar to the notion of (regular) shattering, it can be defined for any collection 
of functions / : X — )■ [0,1], where X is any set, but for sake of this report, the 
following sections (still) assume X = {X, S) is a measurable space and the collection 
of functions is a function class J-". 

Definition 5.1 ([7J). Let be a function class. Given a subset S = {xi, . . . C 
X, the class T e-shatters S, with witness c = (ci,...,c„) G [0,1]", if for every 
e G {0, 1}", there exists f ^ T such that 



Definition 5.2 ([7]). The Fat Shattering dimension of scale e > of J^, denoted 
by fate(J-'), is defined to be the cardinality of the largest finite subset of X that can 
be e-shattered by T . If T can e-shatter arbitrarily large finite subsets, then the Fat 
Shattering dimension of scale e of is defined to be oo. 

When the function class J-" consists of only functions taking values in {0, 1}, then 
the Fat Shattering dimension of any scale e < 1 /2 of J-" agrees with the VC dimension 
of the corresponding collection of subsets of X, induced by the (indicator) functions 



Proposition 5.3. Suppose a function class J-" consists of only binary functions f : 
X — i- {0,1}. For every f E F, there exists a unique subset Aj C X such that 
Xa^. = f. Moreover, write C = {Aj : f E J^} and VC(C) = fat,(J^) for all e < 0.5. 

Proof. The first statement, of the existence of a unique subset Af C X for every 
binary function /, is clear. Let e < 0.5. To show that VC(C) = fat^{J^), it suffices to 
prove that C shatters S = {xi, . . . , x„} if and only if e-shatters S. 

The equivalent condition to shattering as seen in Proposition 14.21 will be used. 
Suppose C shatters 5" and define c = (0.5, 0.5, . . . , 0.5) G [0, 1]"". For every e G {0, 1}", 
there exists Af E C, where f E J^, such that 



f{xi) > Ci + e for Ci = 1, and /(xj) < Q — e for Cj = 0. 



in J-". 



XAf{xi) = e 



for alH = 1, . . . , n and thus. 



f{Xi) = XAf{Xi) 



Ci > 0.5 + e for Cj = 1 



and 



fiXi) = XAfiXi) 



Ci < 0.5 — e for Cj = 0. 
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Conversely, suppose T e-shatters with witness c = (ci, . . . , c„) e [0, 1]". Let 
e e {0, 1}" and there exists f & T such that 

/(xj) > Q + e for Cj = 1, and /(xj) < Q — e for Cj = 0, 

but / is binary and e is strictly positive, so /(xj) > Cj + e implies f{xi) = 1 for Cj = 1 
and f{xi) < Ci — e implies f{xi) = for = 0. As a result, consider Af & C and 

for alH = 1, . . . , n. Therefore, VC(C) = fat,(J^). □ 

Here is an example of a commonly used function class which we proved, indepen- 
dent of any sources, to have infinite Fat Shattering dimension of scale e. 

Example 5.4. Let X = M"*" and let T he the set of all continuous functions f : X ^ 
[0, 1]. Then fat,(J') ^ oo for all < e < 0.5. 

Proof. Suppose < e < 0.5, and consider a collection of continuous [0, l]-valued 
functions defined as follows. Given e e {0, 1}^, a countable binary sequence, define 
fe-.X^ [0, 1] by 

' 1 if Ci = 1 
if = 0, 



fe(x) 

ii x — i & N. Otherwise, for x e [m, m + 1], with m e N, 



-{x-m) + l if = 1, e^+i = 
/e(^) = {{x -m) if = 0, e^+i = 1 

6j7i if e^yi 6^_|_i. 

For each e e {0, 1}^, /g is continuous because it is defined as a step function of 
lines which agree on the overlaps. Write F = {/e : e G {0, 1}^} and F C J^. To 
show that fate (J-") = oo, it suffices to prove that fatg(F) = oo. Consider the subset 
S — {1, . . . ,n} C X ioT any n G N, and the collection F e-shatters S with witness 
c = (0.5, 0.5, . . . , 0.5) e [0, 1]": for each e G {0, 1}", it can be extended to a countable 
binary sequence e, where = Cj for alH = 1, . . . , n and e-i — otherwise. Then, it is 
clear that 

feixi) = 1 > Q e for gj = 1, and f{xi) = < q - e for gj = 0, 

with Xi — i & S for i — 1, . . . ,n. □ 

With the generalization from a concept class to a function class, a natural question 
is whether the finiteness of the Fat Shattering dimension of all scales e for a function 
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class T is equivalent to T being distribution-free PAC learnable. This question is 
addressed in the following subsection. 

5.1 Sufficient condition for function class distribution- free 
PAC learning 

One direction of Theorem 14.91 can be generalized and stated in terms of the Fat 
Shattering dimension of scale e of a function class. 

Theorem 5.5 ([1] and [IH])- -^ei T he a junction class. //fate(J-') < oo for all e > 0, 

then T is distribution-free PAC learnable. 

However, the converse to Theorem 15.51 is false. There exists a distribution-free 
PAC learnable function class with infinite Fat Shattering dimension of some scale e. 

In fact, for every concept class C with cardinality or 2*^°, there is an associated 
function class J^c defined as follows. Set up a bijection 6 : C — > [0, 1/3] or to [0, 1/3] fl 
Q, depending on the cardinality of C, and for every A & C, define a function fA '■ 
X ^ [0, 1] by 

fA{x) = xa{x) + {-ir-^%{A). 

Now, write J^c = {fA '■ A G C}. Note that J^c can be thought of the collection of 
all indicator functions of A G C, except that each "indicator" function has two 
unique identifying points b{A) and 1 — b{A), instead of simply and 1. The following 
proposition provides many counterexamples to Theorem 15. 5[ which are much simpler 
than the one found in [T8] . 

The construction of the function class J-^ and the proposition below are developed 
from an idea of Example 2.10 in [llj . 

Proposition 5.6. Let C be a concept class. The associated function class = {fA '■ 
A & C}, defined in the previous paragraph, is always distribution-free PAC learnable; 
this class has infinite Fat Shattering dimension of all scales e < 1/6 if C has infinite 
VC dimension. 

Proof. The function class J^c is distribution-free PAC learnable because every func- 
tion fA G J^c can be uniquely identified with just one point Xq G X in any labeled 
sample: fA^xo) G {b{A), 1 — b{A)} uniquely determines A and thus, /a. 

Furthermore, suppose C has infinite VC dimension. Let n G N be arbitrary and 
because VC(C) = oo, there exists S = {xi, . . . such that C shatters S. Suppose 
e < 1/6 and we claim that J^c e-shatters S with witness c = (0.5, . . . , 0.5) G [0, 1]". 
Indeed, let e G {0, 1}" and there exists A E C such that 

XA{Xi) = Ci, 

for alH = 1, . . . , ra, by Proposition 14.21 As a result, 

fA{xi) = 1 - b{A) > 0.5 + e for a = 1 
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and 

fA{xi) = b{A) < 0.5 - e for a = 0. 
Consequently, Tc has infinite Fat Shattering dimension of all scales e < 1/6. □ 

The next section explains the main result of our research: bounding the Fat 
Shattering dimension of scale e of a composition function class which is built with a 
continuous logic connective. 
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6 The Fat Shattering Dimension of a Composition 
Function Class 



The goals of this section are to construct a new function class from old ones by means 
of a continuous logic connective and to bound the Fat Shattering dimension of scale 
e of the new function class in terms of the dimensions of the old ones. The following 
subsection provides this construction, which can be found in Chapter 4 of [18], in the 
context of concept classes using a connective of classical logic. 

6.1 Construction in the context of concept classes 

Let Ci,C2, ■ ■ ■ ,Ck be concept classes, where k > 2, and let u : {0, l}'^ — )■ {0, 1} be 
any function, commonly known as a connective of classical logic. A new collection of 
subsets of X arises from Ci, . . . ,Ck as follows. 

As mentioned earlier in this report, every element A ^ Ci can be identified as 
a binary function / : X — )■ {0, 1}, namely its characteristic function / = xa, and 
vice versa. Now, for any k functions fi,...,fk : X — )■ {0, 1}, where fi G Ci with 
i = 1, . . . , k, consider a new function . . . , /fc) : X — )• {0, 1} defined by 

. . . , fk){x) = . . . , fk{x)). 

The set of all possible . . . , fk), denoted by u{Ci, . . . , Ck), is given by 

uiC,,...,Ck) = Mh,...,fk)■.f^eQ}. 

For instance, when = 2, we can consider the "Exclusive Or" connective © : 
{0, 1}2 -> {0, 1} defined by 

p® q = {p A -ig) V (-ip A g), 

which corresponds to the symmetric difference operation. Then, our new concept 
class constructed from Ci and C2 is 

{A1AA2 : Ai E Ci,A2 e C2}. 

The next theorem states that if Ci,C2, ■ ■ ■ ,Ck all have finite VC dimension to 
start with, then regardless of u, the new collection u{Ci, . . . , Ck) always has finite VC 
dimension. 

Theorem 6.1 ([18j). Let k >2. Suppose Ci, . . . ,Ck are concept classes, each viewed 
as a collection of binary functions, andu : {0, 1}^ — )■ {0, 1} is any function. If the VC 
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dimension of Ci is finite for all i = 1, . . . ,k. Then there exists a constant a = a, 
which depends only on k, such that 



k 

where d = maxVC(Ci). 

j=i 

The proof of this theorem can be found in pL8] and uses Sauer's Lemma to bound 
the VC dimension of m(Ci, . . . , Ck). The main objective of our project was to generahze 
this theorem for function classes, in terms of the Fat Shattering dimension of scale 
e, but the connective of classical logic u would have to be replaced by a continuous 
logic connective, a continuous function u : [0, l]'^ — )■ [0, 1]. 

6.2 Construction of new function class with continuous logic 
connective 

In first-order logic, there are only two truth- values or 1, so a connective is a function 
{0, l}'^ — )■ {0, 1} in the classical sense. However, in continuous logic, truth-values 
can be found anywhere in the unit interval [0,1]. Therefore, we should consider 
a function u : [0, l]'^ — )■ [0, 1], which will transform function classes, and require 
that M be a continuous logic connective. In other words, u should be continuous 
from the (product) metric space [0,1]'^ to the unit interval [19j; in fact, because u 
is continuous from a compact metric space to a metric space, it is automatically 
uniformly continuous. 

The following provides the definition of a uniformly continuous function u from 
any metric space to another, but we must first qualify u with a modulus of uniform 
continuity. 

Definition 6.2 (See e.g. ^1^)- ^ modulus of uniform continuity is any function 
<5: (0,1] ^(0,1]. 

Definition 6.3 (See e.g. [H]). Let {Mi.di) and {M2,d2) be two metric spaces. 
A function u : Mi — )■ M2 is uniformly continuous if there exists (a modulus of 
uniform continuity) S : (0, 1] — )■ (0, 1] such that for all e G (0, 1] and mi,m2 G Mi, if 
di{mi,m2) < S{e), then d2{u{mi) , u{m2)) < e. 

Such a 6 is called a modulus of uniform continuity for u. 

In particular, u : [0,1]^^ — t- [0,1], where [0,1]^^ is equipped with the L2 product 
distance d"^, is uniformly continuous with modulus of uniform continuity 6 if for every 

^More specifically, a = ak is the smallest integer such that 



log(ea) ' 
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e e (0, 1] and for every (ri, . . . , rfc), {r[, ...,r[)e [0, 1]'', 



d^{{ri,...,rk),{r[,...,r'^)) < (5(e) \u{ri, . . . ,rk) - u{r[, . . . , r'^)\ < e. 

Given function classes J-'i, . . . ,J^k and a uniformly continuous function u : [0, l]'^ — > 
[0, 1], consider the new function class u{J-'i, . . . , J-'k) defined by 

. . . , J'fc) = . . . , /fc) : /i e I'i}, 

where u{fi, . . . , fk){x) = u{fi{x), . . . , fk{x)) for all x G X, just as in Section EH] 
for concept classes, with /, G J^i and i = 1, . . . ,k. Our main result states that the 
Fat Shattering dimension of scale e of u{J^i, . . . ,J^k) is bounded by a sum of the Fat 
Shattering dimensions of scale 6{e, k) of J^i, . . . , J-^, where 5(e, fc) is a function of the 
modulus of uniform continuity 5(e) for u and k. It is a known result, seen in Chapter 
5 of [IHI, that this new class u{J^i, . . . , J-^) has finite Fat Shattering dimension of all 
scales e > (and thus, it is distribution- free PAC learnable) if each of J-*!, . . . , J-^ has 
finite Fat Shattering dimension of all scales, but no bounds were known. 

6.3 Main Result 

Fix k > 2 and the following theorem is our main new result. 

Theorem 6.4. Let e > 0, J-'i, . . . , J^k be function classes of X , and u : [0, l]'^ — )■ [0, 1] 
he a uniformly continuous function with modulus of continuity 5(e). Then 

\^ A'log(2) j ^ kVk 

where c, c', K' are some absolute constants. 

Extracting the actual values of these absolute constants is not easy, and we hope 
to find them in future research. For this reason, comparing the bound in Theorem 
I6.4l with the existing estimate for the VC dimension of a composition concept class is 
difficult; however, in statistical learning theory, estimates for function class learning 
are generally much worse than estimates for concept class learning. 

In order to prove Theorem 16.41 for clarity, we first introduce an auxiliary function 
: J-'i X ... X J^k [0) 1]"^) which is uniformly continuous from the metric space 
J^i X . . . X J^fc with the L2 product distance d'^ to the metric space [0, 1]^ with distance 
induced by the -Z^2(/^) norm, and prove the following lemma. 

Lemma 6.5. Let e > 0, J-'i, . . . , J-^ be function classes of X , and : J-'i x . . . x — )■ 

[0,1]"^ be uniformly continuous with some modulus of continuity 6{e,k), a function 
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of t and k. Then 

fatc',(0(J^i X ...xTk)) < — — — yiat s(..k){Ti), 

\ K' log(2) y ^ ^ vu 

where c, c', K, K' are some absolute constants and the symbol 4>{J^i x . . . x J'k) simply 
represents the image of (p. 

Then, we will relate the two uniformly continuous functions u and 0. 

Lemma 6.6. Let e > 0. If u : [0, 1]^ — t- [0, 1] is uniformly continuous with modulus 
of continuity 5(e), then the function (p : J-'i x . . . x J'k ^ [0, 1]^ defined by 

0(/l, • • • , fk){x) = U{fi{x), fk{x)) 

is also uniformly continuous with modulus of continuity ^^f^' ^''^^ f^*^^' ^(-^i ^ 
6.4 Proofs 

In order to prove Lemma I6.5[ we first introduce the concept of an e-covering number 
for any metric space, based on [9j, and relate this number for a function class to its 
Fat Shattering dimension of scale e by using results from Mendelson and Vershynin 
[^J] and Talagrand |15j . 

Definition 6.7. Let e > and suppose (M, d) is a metric space. The e-covering 
number, denoted by N{M, e, d), of M is the minimal number N such that there exists 
elements mi,m2, . . . ,mN € M with the property that for all m G M, there exists 
i e {1,2, . . . ,N} for which 

d{m, rrii) < e. 

The set {mi,m2, . . . , ttin} is called a (minimal) e-net of M. 

The following proposition relates the e-covering number of a product of metric 
spaces, with the L2 product distance d"^. Mi x . . . x to the ^-covering number 
of each space Mj. 

Proposition 6.8. Let e > and suppose {Mi,di), . . . ,{Mk,dk) are metric spaces, 
each with finite -^-covering numbers, Ni = N{Mi, di) for i = 1, . . . , k. Then 

k 

N{MiX ...xMk,e,d^)<\[Ni. 

i=l 

Proof. Let Ci = {a\, . . . , a^.} be a minimal :^-net for Mi with respect to distance 
di, where i = 1, . . . ,k and suppose (a^, . . . , a^) G Mi x . . . x Mk. Then, for each 
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i = l,...,k, there exists a*. G Cj, where I < ji < Ni such that (ij(a*,a*.) < 
Hence, 

d2((a\ . . . , a^^), (a]^, . . . , ajj) = y/((rfi(ai, ajJP + . . . + (4(a^ ^J)^) 




= e, 



where each (aj^, . . . , a*'^) G Ci x . . . x Ck, which has cardinahty H^^^A^j. Therefore, 
N{Mi X . . . X Mfc, e, (^2) < U'y^^Ni. □ 

Also, if -u : Ml — )■ M2 is any uniformly continuous function with a modulus of 
uniform continuity (5(e) from any metric space to another, then the image of a minimal 
5(e)-net of Mi under u becomes an e-net for u{Mi). 

Proposition 6.9. Let e > and suppose {Mi, di) and (M2, 1^2) are two metric spaces. 
If a function u : Mi — > M2 is uniformly continuous with a modulus of continuity 5(e), 
then N{u{Mi),e,d2) < N{Mi,6{e),di), where u{Mi) denotes the image ofu. 

Proof. Suppose = A^(Mi, 5(e), rfi) is the 5(e)-covering number for Mi and let 
{mi, . . . , mAf} be a 5(e)-net for Mi. Hence for every u{m) G u(Mi), where m G Mi, 
there exists i G {1, . . . , A^} such that 

di{m, m,i) < (5(e), 

which implies d2{u{m) , u{mi)) < e as u is uniformly continuous. As a result, the set 

{u{mi), . . .,u{mN)} 

is an e-net for u{Mi), so 

N{u{Mi),e,d2) < N{Mi,6{e),di). 

□ 

In particular, we can view J-'i, . . . , J-^ as metric spaces, all with distances induced 
by the 1/2 (/x) norm and suppose (/) : J-'i x . . . x J^^ — )■ [0, 1] is uniformly continuous 
with modulus of continuity (5(e, k). Then, by Proposition 16.81 if J-'i, . . . , J-fc all have 
finite ^^^-covering numbers, the metric space J-'i x ... x J^^, with the L2 product 

metric J^, also has a finite (5(e, /c)-covering number: if we write N{J'i, L2{fJ,)) as 
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the -covering number for J-'j, then, 



iV(J-i X . . . X J-,, 5(e, k), d') < n iV(J-„ LM). 

i=l ^ 



Now, by Proposition I6.9[ 

N{(j){Ti X . . . X J-fc), e, L2{n)) < N{J^i X . . . X J-fc, 5(e, k), (P) 



<riiV(-F.,^,L,(;.)). 



In other words, the e-covering number for (j){J^i x . . . x J'k) is bounded by a product 
of the ^^^^ -covering numbers of each J-'j. To prove Lemma [675| we now state the main 
theorem of a paper written by Mendelson and Vershynin, which relates the e-covering 
number of a function class to its Fat Shattering dimension of scale e. 

Theorem 6.10 ([9]). Let e > and let J-" be a function class. Then for every 
probability measure fi, 

N{J^,e,LM)< (- 

for absolute constants c, K. 

And Talagrand provides the converse. 

Theorem 6.11 ([15]). Following the notations of Theorem \6.10[ there exists a prob- 
ability measure fj, such that 

for absolute constants c',K'. 

Proof of Lemma 16'. 5[ By Propositions 16.81 and 16. 9^ 

iV(0(J-i X ... X J-,), e, L2(/x)) < n ^(-^M LM), 

so 

log(iV(0(J-i X ... X J^k),e,LM)) < ^log(iV(J-„^^^,L2(/x))). 

i=l V /c 

By Theorem 16. lUl 

logiV(J-„%^,L2(/i)) < KM^s^{T,)\og{2Vk/5{e,k)), 
V k 
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for any probability measure /i where c, K are absolute constants. Moreover, by The- 
orem EITT] for some probability measure /i and absolute constants c',K\ 

\og{N{<P{7i X . . . X J-fc), e, L2(/i))) > i^'fat,.,(0(J-i x . . . x 7k)) log(2) 

and altogether, 

Eti KM^h^{:F,) log(2v^/5(e. A;)) 
fate'.l^lJ-i X . . . X J-fc)) < ^ 



i^'log(2) 



' i^log(2v/fcM6,A:)) ^^^^^ 
^71^^(^^ j ^fat^^(^0- 



□ 



Now, all that is left is to prove Lemma 16.61 



Proof of Lemma \6. (A Suppose u : [0, l]'^ — )■ [0, 1] is uniformly continuous with a mod- 



ulus of continuity 5(e), where [0, 1]*^ is a metric space with the L2 product distance 
(P. We claim that the function : J^i x . . . x Fk [0, 1] defined by 

(t>{h, • • • , fk){x) = . . . , fk{x)) 

is uniformly continuous with modulus of continuity Let e > and 

(/{,..., /^) G J-i X ... X J-fc. 

Suppose 

^ 6{e/2)e [W/WWW 



2k V P 
Hence, for each i = 1, . . . , k, 



X 



Write = {x G X : - //(a;)| > \/^^^} and we must have that /^(Ai) < 
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for each i — 1, . . . ,k. Otherwise, 
/ (Mx) - f[{x)f dy^ix) = / {U{x) - f[{x)f dy^ix) + / {U{x) - f[{x)f dy^{x) 

Jx JAi J X\Ai 



IX\Ai 

IX\Ai 



> 



k k 

5{e/2)\e/2f 



k^ 

which is a contradiction. Now, write A — AiVJ . . .\J Ak and we have that X\A 



{x e X : \fi{x) - f[{x)\ < y for alH = 1, . . . , A;}. Suppose x e X \ A and 

then 

d\{h{xl . . . , Ux)l {f[{x), . . . , f,{x))) = ^\h{x)-f[{x)\^-r...-r\h{x)-fl{x)\^ 

< S{e/2). 

Consequently, by the uniform continuity of m, for all a; e X \ A, 
\u(h(x), fk(x)) - u(f[{x), f,(x))\ < e/2. 
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Finally, 



X 



'X\A 



+ lHh{x). Mx)) - uifiix), fl{x)))^ dfxix] 



<^/( J^^J,e/2)^dfi{x)j+^\^J^ldf,{x] 
< (e/2) + (6/2) = e, 

as M < Eti KA.) < k ((^) = (e/2)2. □ 

Now we will prove our main theorem. 
Proof of Theorem 6.4' By Lemma [6.61 if u : [0, 1]*^ — > [0, 1] is uniformly continuous 



with modulus of continuity 6{e), then : J^i x . . . x 

Tk^[0,l] defined by 
• • • , = u{fi{x), . . . , 

is also uniformly continuous with modulus of continuity Then, apply Lemma 

16.51 with 5{e, k) = ^^^^^ and with a simple change of variables c'e' — )■ e. Theorem 16.41 
follows directly. □ 

Altogether, we can summarize the maps in this section in the following two dia- 
grams (where i is the diagonal map): 

X ^ ^^^^^^^ [0, if — [0, 1] , 

while 

J-i X . . . X J-fc ^ [0, 1]^ . 

This result is potentially useful because it allows us to construct new function 
classes using common continuous logic connectives and bound their Fat Shattering 
dimensions of scale e. For instance, the function u : [0, 1]^ — )■ [0, 1] defined by 
u{ri,r2) = Ti ■ r2 (multiplication) is uniformly continuous with a modulus of con- 
tinuity (5(e) = |. Indeed, let e > and consider (ri, r2), (r'^^, rj) G [0,1]^. Suppose 
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c?2((ri,r2),(r;,r^)) < (5(e) = |, so 



|ri - r'J < a/Iti - r^p + |r2 - r'<^ < - 

and similarly, |r2 — Tgl < |. Then, 

\uiri,r2) - u{r[,r2)\ = \r1r2 - r[r2\ 

= \r1r2 — rir'2 + rir'2 — t'^t'2\ 

= |ri(r2 -rg) + r2(ri - r[)\ 

< |ri(r2 - rg)! + |r2(ri - r[)\ 

< 1^2 — + |ri — r[\ 
e e 

■=2 + 2 = " 

As a result, if ^-"1 and J-2 are two function classes with finite Fat Shattering 
dimensions of some scale e, then the function class J-2) = = {fi ■ /2 : 

/i G J-*!, /2 G J-2}, defined by point- wise multiplication, also has finite Fat Shattering 
dimension of scale e, up to some constant factor and Theorem 16.41 provides a precise 
bound. 

We have made an interesting connection, which has not been explored much in 
the past, between continuous logic and PAC learning, and we plan to investigate this 
connection even further. For instance, the relationship of compositions of function 
classes and continuous logic may be interesting to study because compositions of 
uniformly continuous functions are again uniformly continuous. Furthermore, we can 
try to add some topological structures to concept classes to see how PAC learning 
can be affected. The next section provides a couple of other possible future research 
topics. 



28 



7 Open Questions 



The definitions of distribution-free PAC learning, for both concept and function 
classes, in Section |3l made no assumptions about probability measures, as a learning 
algorithm has to produce a valid hypothesis for any probability measure fi. If we fix 
a probability measure /x and ask whether a concept class, or a function class, is PAC 
learnable, then we are working in the context of fixed distribution PAC learning. 

Definition 7.1 ([H]). Let fi be a probability measure. A function class T is Probably 
Approximately Correct learnable under yU ij there exists an algorithm L : Um,eN(-^ x 
[0, l])*" — > J-" with the following property: for every e > 0, for every 5 > 0, there exists 
a M e N such that for every f ^ T , for every m > M , for any xi, . . . ,Xm ^ X , we 
have Efj^{Hm, /) < e with confidence at least I — S, where 



and Hm = L{{xi, /(xi)), . . . , /(x^))). 

When a function class J-" consists of only binary functions, i.e. J-" = C is a 
concept class, there is a theorem, proved by Benedek and Itai in 1991, which gives a 
characterization of fixed distribution PAC learnability. 

Theorem 7.2 (^). Fix a probability measure /x and consider a concept class C. The 
following are equivalent: 

1. C is Probably Approximately Correct learnable under /i. 

2. (Finite Metric Entropy condition) The e-covering number of C when viewed as 
a metric space with distance d = fi{_ A _) is finite for every e > 0. 

However, there is no characterization for fixed distribution PAC learnability of 
a general function class. Talagrand had proved that a function class is a Glivenko- 
Cantelli (GC) function class with regard to a single measure fi if and only if the class 
has no witness of irregularity, a property that involves shattering [13],|ll]. Every GC 
function class is PAC learnable under n |TT], but the property of having no witness 
of irregularity is strictly stronger than PAC learnability. We would like to propose 
the following conjecture for a possible characterization. 

Conjecture 7.3. Fix a probability measure ^ and consider a function class T . Let 
e > 0. The following are equivalent: 

1. The function class T is PAC learnable under /i to accuracy 

^Being PAC learnable to accuracy e means Definition 1 7. II is satisfied, but only for this particular 




e. 
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2. There exists M, N and 7 > such that for all functions f G J^, with probability 
at least 7, the set {g G : g\-^ = f\^^} has an e-covering number, with respect 
to the distance d = Efj_{_,_), of at most M, where xn denotes a sample of N 
points. 

A very interesting research topic is to study this conjecture and either prove or 
disprove it. Also, by Proposition l5.6l the finiteness of the Fat Shattering dimension of 
all scales e > does not characterize function class PAC learning in the distribution- 
free case; consequently, another topic of research would be to come up with a new 
combinatorial parameter for a function class, related to the notion of shattering, 
which would characterize learning. This new parameter would have to solve the 
problem of unique identifications of functions, a problem that does not occur with 
concept classes. 

Yet another possible research topic is to generalize the definitions of PAC learning 
and introduce observation noise, both in the fixed distribution and distribution-free 
cases. The paper [3] written by Bartlett et al. proves that the finiteness of the 
Fat Shattering dimension of all scales of a function class J-" is equivalent to J-" being 
distribution-free learnable under certain noise distributions. It would be interesting 
to generalize this result and/or apply it in the fixed distribution setting. 
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8 Conclusion 



This report introduces the definitions of Probably Approximately Correct learning 
for concept and function classes and defines the Vapnik-Chervonenkis dimension for 
concept classes and the Fat Shattering dimension of scale e > for function classes. 
Finiteness of the VC dimension characterizes concept class distribution-free PAC 
learning; however, the finiteness of the Fat Shattering dimension of all scales e is still 
only sufficient for function class learning, and not necessary. 

Given function classes ^i, . . . , Jt;, one can construct a new class u{J^i, . . . ,J^k) 
using a continuous function u : [0, 1]*^ — )■ [0, 1], a continuous logic connective. The 
main new result of this report shows that the Fat Shattering dimension of scale e of 
u{J-\, . . . , Tk) is bounded by a sum of the Fat Shattering dimensions of scale 5(e, k) of 
classes J^i, . . . , J^^, up to some absolute constants. This result can be useful because 
it allows us to construct new function classes, which may be very natural objects, 
and bound their Fat Shattering dimensions. 
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